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Abstract 

We study a high order finite difference scheme to 
solve the time accurate flow field of a jet using the com- 
pressible Navier-Stokes equations. As part of our ongo- 
ing efforts, we have implemented our numerical model on 
three parallel computing platforms to study the compu- 
tational, communication, and scalability characteristics. 
The platforms chosen for this study are a cluster of work- 
stations connected through fast networks (the LACE ex- 
perimental testbed at NASA Lewis), a shared memory 
multiprocessor (the Cray YMP), and a distributed mem- 
ory multiprocessor (the IBM SP1). Our focus in this 
study is on the LACE testbed. We present some re- 
sults for the Cray YMP and the IBM SP1 mainly for 
comparison purposes. On the LACE testbed, we study: 
(a) the communication characteristics of Ethernet, FDDI 
and the ALLNODE networks and (b) the overheads 
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induced by the PVM message passing library used for 
parallelising the application. We demonstrate that clus- 
tering of workstations is effective and has the potential 
to be computationally competitive with supercomputers 
at a fraction of the cost. 

Key words: Navier-Stokes computations, parallel 
computing, cluster computing. 


1. Introduction 

Numerical simulations can play an important role in 
examining the physical processes associated with many 
important practical problems. The suppression of jet ex- 
haust noise is one such problem which will have a great 
impact on the success of the High Speed Civil Transport 
plane (HSCT). The radiated sound emanating from the 
jet can be computed by solving the full (time-dependent) 
compressible Navier-Stokes equations. This computation 
can, however, be very expensive and time consuming. 
The difficulty can be partially overcome by limiting the 
solution domain to the near field where the jet is nonlin- 
ear and then using acoustic analogy [e.g., LighthilU] to 
relate the fat-field noise to the near-field sources. The lat- 
ter technique requires obtaining the time-dependent flow 
field. In this study we concentrate on such flow fields 
near the nozzle exit. 
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The advent of massively parallel processors and clus- 
tered workstations gives the opportunity for scientists to 
parallelise their computationally intensive codes and re- 
duce turnaround time. Recognising this, a number of 
researchers^ ® ^ ® have studied CFD applications on spe- 
cific parallel architectures. Our goal is to implement a 
particular CFD application code which uses a high or- 
der finite difference scheme on a variety of parallel com- 
puting platforms to study the computational, commu- 
nication, and scalability characteristics. The platforms 
chosen for this study are a cluster of workstations con- 
nected through fast networks (the LACE** experimental 
testbed at NASA Lewis), a shared memory multiproces- 
sor (the Cray YMP), and a distributed memory multi- 
processor (the IBM SP1). Clustered workstations such 
as the LACE are becoming increasingly popular because 
they show promise as a low cost alternative to expensive 
supercomputers and massively parallel processors. We 
have followed a pragmatic approach by parallelising only 
those segments of the algorithm which are computation- 
ally intensive (identified using a profiler). Such an ap- 
proach can reduce the time for parallelizing “dusty deck” 
sequential codes. Our distributed memory implementa- 
tion on the LACE uses PVM^, a portable message passing 
library developed at the Oak Ridge National Laboratory. 

In the next two sections we briefly discuss the gov- 
erning equations and the numerical model. Section 4 
has a discussion of the parallel architectures used in the 
study and the parallelization schemes. In Section 5 we 
present an analytical model for computation and com- 
munication costs and discuss simulation results on the 
three platforms used in this study. 

2. Governing Equations 

We solve the Navier-Stokes and the Euler equations 
to compute flow fields of an axisymmetric jet. The 
Navier-Stokes equations for such flows in polar coordi- 
nates can be written as 
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F and G are the fluxes in the x and r directions respec- 
tively, and S is the source term that arises in the cylindri- 
cal polar coordinates, and are the shear stresses. One 

obtains the Euler equations from the above equations, by 
setting k and all equal to zero. 

3. Numerical Model 

We use the fourth order MacCormack scheme, due 
to Gottlieb and Turkel®, to solve the Navier-Stokes and 
the Euler equations. For the present computations, the 
operator L in the equation L Q = S or equivalently Qt 4 
F z 4 G r = S is split into two one-dimensional operators 
and the scheme is applied to these split operators. For 
the one dimension model/split equation Qt + F z — 5, we 
express the predictor step with forward differences as 

Qx = Qi + 6^{W+1 - F T) - (*?+: 2 - f? + 1)} + A tSi 

and the corrector step with backward differences as 

Q? +l = \[Qi + Q? 

~ — (F*- 1 ~ ^>- 2 )} + At5<] 

This scheme becomes fourth-order accurate in the spatial 
derivatives when alternated with symmetric variants 
We also use the standard MacCormack scheme in our 
tests. This scheme is second order accurate in both space 
and time and can be written as 


where 
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We define L\ as a one dimensional operator with a for- 
ward difference in the predictor and a backward differ- 
ence in the corrector. Its symmetric variant Li uses a 
backward difference in the predictor and a forward dif- 
ference in the corrector. For our computations, the one 
dimensional sweeps are arranged as 
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Q n+1 = L u L lr Q n 
Q n+2 = L 7r L 2t Q n+1 

This scheme is used for the interior points. In or- 
der to advance the scheme near boundaries the fluxes are 
extrapolated outside the domain to artificial points us- 
ing a cubic extrapolation to compute the solution on the 
boundary. At the outflow boundary, we then solve the 
following set of equation to get the solution at the new 
time for all boundary points. 

Pt - pcu t = 0 

Pt + pcu t = R 2 

Pt — C 2 Pt = i?3 

v t = R* 

where Ri is determined by which variables are specified 
and which are not. Whenever, the combination is not 
specified, Ri is just those spatial derivatives that come 
from the Navier-Stokes equations. Thus, Ri contains vis- 
cous contributions even though the basic format is based 
on in viscid characteristic theory. This framework of out- 
flow boundary condition implementation is discussed by 
Hayder and Turkel^. Due to the presence of a singular- 
ity, special care is needed to treat the centerline. Here, 
we briefly outline one of the formulations for the cen- 
terline used in three dimensional calculations. In this 
treatment, equations corresponding to the Cartesian co- 
ordinates are solved at the centerline. Away from the 
centerline variables are solved in polar coordinates. We 
use r 2 = a^+y 2 , tanff = y/z, x = rcosB and y = rsinO to 
derive the equations at the centerline. The conservation 
equation for the Cartesian variables is 

Wt + F x + Gy + H £ = 0 

or 

+ {Fyo-Gxo^ + (Gz T -F y r )e +Hi = Q 
r r 

or 

( rW) t + (vF) T + G e + (rS) z = 0 

where F = FcosB + GsinO and G = GcosO — FsinO 
Because both F and G are single valued, Fg and G$ are 
zero at the centerline. Then using the L’Hospital rule at 
r=0 we get 

W t + (F + G e ) T cos9 + (G - Fg) r sinB + H z = 0 

The above equation is solved at the centerline in three 
dimensional calculations. For axisym metric calculations, 


we have options to use the L ’Hospital rule or extrapola- 
tions to get the quantities at the centerline. Our numer- 
ical model is three dimensional. However, to maintain 
low computational costs, in this study we obtained only 
two dimensional results. Further discussion of our nu- 
merical model is given in Hayder et al*^ and Mankbadi 
et al.ll 


4. Parallel Implementations 


4.1 Network of Workstations 

The LACE (Lewis Advanced Cluster Environment) 
is a cluster of 33 IBM RS6000 workstations (nodeO — 
node32). Node 0 is the file server. All the nodes are con- 
nected through two Ethernet networks (10 Mbits/sec). 
In addition, nodes 00—16 are connected through the much 
faster FDDI interface (100 Mbits/sec) and nodes 17— 
32 are connected through IBM’s ALLNODE prototype 
switch (capable of a peak throughput of 512 Mbits/sec). 
The ALLNODE switch is an Omega interconnection net- 
work capable of providing multiple contentionless paths 
between the nodes of the cluster. 

Communication among the processors is through 
message passing . We used the popular PVM (Parallel 
Virtual Machine) message passing library (version 3.2.2) 
to implement our parallel program. We partitioned the 
domain along the z direction only. Our partitioning 
scheme was static since there was no explicit need to keep 
long vectors given the architecture of the RS6000. The 
LACE test bed is upgraded periodically. Our results are 
for RS6000/560 processors. We will report results with 
upgraded hardware and software in subsequent studies. 


4.2 Shared Memory Architecture 

We used the Cray YMP/8, which has eight vector 
processors, for this study. The Cray YMP/8 has a peak 
rating of approximately 2.7 GigaFLOPS. It offers a single 
address space and the communication between processes 
executing on different processors is through shared vari- 
ables. Using DOALL directives, we partitioned the do- 
main along the orthogonal direction of the sweep to keep 
the vectors lengths large and to avoid non-stride access 
for most of the variables. 
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4.3 Distributed Memory Multiprocessor 

In this study, we also used the IBM SP1 which is a 
distributed memory, message passing multiprocessor. We 
would like to point out that our test bed is the machine 
at NASA Lewis in which each processor is a RS6000/370. 
This system has the original SP1 processors and software 
upgrades for SP2. The nodes of the SP1 are intercon- 
nected through a variation of the Omega network. This 
network permits multiple contentionless paths between 
nodes like the ALLNODE switch. We used the MPL li- 
brary for parallel implementation of our numerical model 
on the SP1. 

5. Results 

5.1 Analytical Model 

We will first provide a model to estimate the compu- 
tation and communication costs. Note that in one time 
step there is one corrector and one predictor step in each 
of the coordinate directions. Each of the predictor and 
corrector steps has a forward and a backward sweep. Let 
us examine the communication cost for two dimensional 
computations with P z partitions in the z direction. We 
divide the cost of data communications across partition 
boundaries into two parts, C $ and Cf. C, is associated 
with the communication needed to compute stresses or 
heat fluxes at the boundary points. This will be zero for 
the Euler calculations. Cf is associated with the transfer 
of fluxes at the domain boundaries, as required by the 
forward and the backward sweeps. This would be same 
for both the Euler and the Navier- Stokes calculations. 
For this partitioning, we get 

C,,z = 2 n $ [N r t t +f # ] 

Cf,z = n,[. S c N r tt +t # ] 

C t1 r = 2(n, - l)[N r it+U] 

C fy r= 0 

where n s is the number of neighboring variables needed 
for stress and heat flux computations. Subscripts z and 
r in the above set of equations refer the direction of the 
sweep. While velocities at neighboring points in both 
directions are needed for stress computations, temper- 
atures in only the sweep direction are needed for heat 
flux computations. Thus, for two dimensional computa- 
tions, n 9 is three in the sweep direction and two in the 
other direction. Let N r be the number of grid points in 
the radial direction, i, and ft be the startup and vector 
transfer rates (time per element) respectively, n* be the 
number of conservation variables (4 in two dimensions), 
and S c be the width of the stencil measured from the 
point where variables are updated. This is one for the 


standard (second order) MacCormack scheme and 2 for 
the fourth order extension. S c would increase for higher 
order schemes. The coefficient two in above equations 
for C 9 comes from the fact that data exchanges in both 
directions across the partition boundary are needed for 
stress computations. On the contrary, since the algo- 
rithm uses one sided differences of fluxes, the communi- 
cations of fluxes are only one direction at any time. The 
total communication times for one time step per proces- 
sor is then 

Tcm = 2[(4n, - 2 ){N r i t +i t ) + n.(S e lM, + 1,)} 

This is expected to rise as the surface area in the parti- 
tions increase. It is easily seen that the total computation 
time per processor is 

Top = ^N r T c 

where T c is the computation time per grid point and N z 
is the total number of grid points in the z direction. It 
should be noted that as the size of the problem changes, 
T c could also change based on the architecture of the ma- 
chine. As the size of the problem increases, there will be 
an increased amount of cache transfers and misses. This 
would increase the overhead associated with the compu- 
tation. The total time spent per processor is then 

T<T CTn -f Tc 

The inequality is a result of some overlapped computa- 
tions and communications. As the number of processor 
(P t ) increases, Tcm will become significant compared to 

Tcp. 

5.2 Numerical results 

We consider a jet with the mean inflow profile 

Ur = tfoo + (U c - Do.)* 

Tr = T e + (Too - Tc) 9t + - g T )g T 

g T = l[l+tanh(!—^)} 

where 0 is the momentum thickness. The subscripts c 
and oo refer to the centerline and free stream values re- 
spectively. At inflow, we assume the radial velocity is 
zero and the static pressure is constant. The standard 
size of our computational domain is 50 radii in the axial 
direction and 5 radii in the radial direction. We excite 
the inflow profile at location r and time t as 

U(r,t)=U(r) + €Re(Ue i * s * t ) 
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P(r,t) = P(r) + eRe(Pe irStt ) 

P(M) = P( r ) +c-Rc(pe’ x5 **) 

V(r,f) = eflefVV' 5 **) 

U, Vy p and P are the eigenfunctions of the linearized 
equations with the same mean flow profile, e is the exci- 
tation level and S t is the Strouhal number. 

We consider a case with ^ ^ \ , momen- 

tum thickness,# = | and Strouhal number, S t = The 
jet center Mach number is 1.5 while the Reynolds num- 
ber based on the jet diameter is 1.2 million. Our present 
mean flow and eigen function profiles are same as those 
in Scott et al. ^ In Figure 1 we show a contour plot of 
axial momentum from the solution of the Navier Stokes 
equations. A grid of size 100x250 was used in this com- 
putation. This result was obtained after about 16,000 
time steps. For all other results in this paper, we simu- 
lated the same physical problem on the same grid, but 
ran for 5000 time steps. 

We optimized our code on the LACE to reduce com- 
putation time. In the process of optimization, we devel- 
oped several versions of our code. We will refer the orig- 
inal version of the code as version- 1. In version-2, we 
replaced exponentiations in the code by multiplications 
(i.e. o 2 is replaced by a times a, etc.). Then we rear- 
ranged dimensions of the arrays to ensure stride- 1 access 
whenever possible. We call the modified code version- 
3. We also reduced number of divisions and used a sin- 
gle COMMON block instead of many COMMON blocks. 
We will refer these versions as version-4 and version-5 re- 
spectively. In Figure 2 we compare the execution times of 
different versions of our numerical models on a single pro- 
cessor. It can be easily noted that Stride- 1 access was the 
single most important optimization and it reduced the 
computation time considerably. Our partitioning scheme 
resulted in a good load balance among the processors. In 
Figure 3, we show the computation times in different pro- 
cessors for our solutions of the Navier- Stokes equations 
on sixteen processors of the IBM SP1. Results shown 
in this figure include the computation time and the exe- 
cution times for the MPL library calls, but exclude any 
communication wait or transfer time. To study commu- 
nication characteristics in distributed computing, we de- 
veloped two additional versions of our code. In version-6 
some computations are overlapped with communications. 
And finally, we divided interprocessor messages in shorter 
packets to avoid bursty traffic. This is our version-7. In 
Figure 4 we compare execution times on multiple pro- 
cessors connected by different networks. It is noted that 
as the number of processor increases, performance of the 
cluster degrades for the Ethernet due to traffic conges- 
tions. For large numbers of processors, in fact execution 
time increases as the number of processors increases. The 


break even point for the Ethernet was about 8-10 proces- 
sors. Both the ALLNODE switch and the FDDI network 
show sublinear speedups. We observe that the high speed 
of FDDI is balanced by slower multiple contention free 
links in the ALLNODE switch. In Figure 5, we analyze 
the communication characteristics of different networks. 
The communication time we report is really the sum of 
the actual communication time, which is not overlapped 
with the computation time, and the waiting time by a 
processor for messages. Communication times for the 
ALLNODE and the FDDI did not change very much for 
the number of processors used in this study. On the other 
hand, Ethernet shows a significant increase of communi- 
cation times with a large number of processors. For the 
Navier-Stokes computations with the ALLNODE switch 
with 16 processors, communication time is comparable 
to the computation and PVM setup time (PVM set up 
time includes execution times for all PVM library calls, 
including packing and unpacking. This time does not in- 
clude any communication wait and transfer time). Rel- 
ative communication times are lower for the Euler com- 
putations. In Figure 6, we show similar results for the 
standard (second order) MacCormack scheme. Since the 
stencil size is now three points wide in any particular 
direction, instead of five points for the fourth order ex- 
tension, there is less communication. As a result, Ether- 
net performed relatively better. However, we note that 
communication times are comparable to the computation 
and PVM setup time for more than eight processors. We 
should point out that all our results in the study except 
those in Figure 6, were obtained with using the fourth 
order MacCormack scheme. 

In version-6, we tried to optimize our numerical 
model further by doing some computations overlapped 
with interprocessor communications. Such an approach 
should reduce communication times. However, this mod- 
ification also increases the complexity of the numerical 
model and overhead associated with loop setups. Our 
results with overlapped computation and communication 
along with the results with version-7 are shown in Figure 
7. We observe that overlapped computation and com- 
munication did not improve the performance of our nu- 
merical model. With shorter packet size (in version- 7), 
the Ethernet performed better, while we notice degrada- 
tion for simulations using the ALLNODE switch. The 
apparent ineffectiveness of the optimizations in versions 
6 and 7 is likely due to the nature of communication re- 
quirements of our numerical model, i.e., our algorithm. 
We used our numerical model of version 4 to obtain ex- 
ecution times on the IBM-SP1. As indicated earlier, we 
used the MPL library for our parallel implementations. 
Our results are shown in Figure 8. For 16 processors, 
the speed up was about 13 for the Euler and 14 for the 
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Navier-Stokes computations. We would like to point out 
that processors in the SP1 are not exactly same as those 
in the LACE cluster. As shown in this figure, communi- 
cation times were negligible compared to execution times. 
Because of the fast communication, we expect to see sig- 
nificant speedups with additional processors. 

On the Cray- YMP, initially we ran the model with 
only the Autotasking option on and obtained speedups of 
slightly above one. We then identified the computation- 
ally intensive portions of the code using a profiler, per- 
formed subroutine inlining, and partitioned data struc- 
tures explicitly into shared and private variables. With 
these modifications, using eight processors, we obtained 
speedups of more than 6 for both the Euler and the 
Navier Stokes computations. Our results on the Cray 
YMP are shown in Figure 9. 

We compare results of three computing platforms in 
Figure 10.1 and 10.2. For the LACE cluster, we show 
results with the ALLNODE switch. Execution times in 
both the LACE cluster and the IBM SP1 are likely to 
go down with additional processors. Because of the fast 
network, speed ups for the IBM SPl are likely to be 
higher. The LACE cluster with 16 processors is about 
one and a half to two times slower than a single proces- 
sor Cray YMP. The performance of the IBM SPl appear 
to be similar. Both the LACE cluster and the IBM SPl 
are going through upgrades and performances are going 
to improve. Parallel computers, specially the network 
of workstations are becoming viable and cost effective 
alternatives to traditional supercomputing. For cluster 
computing, one should try to use a fast network. Oth- 
erwise the communication times can become significant 
and in fact the execution time can rise as processors are 
added beyond a certain number. In our tests with large 
number of processors, the ALLNODE performed well and 
the FDDI was also comparable. On the other hand, the 
Ethernet performed poorly for such cases. We plan to 
follow up our present work with studies of other parallel 
computing platforms including the Cray T3D and exam- 
inations of other numerical models. 
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Figure 5.1: Navier Stokes computations on the LACE 
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